% Chapter about application of terminological shadows



%\documentclass[oneside]{oxthesis}

%\begin{document}
%\addtocounter{chapter}{6}


\chapter{Applications of terminological shadows}
\label{ch:apply}


The two-level model EHR paradigm has come a long way over the last decade. The advent of two-level
model based EHR standards such as EN13606 and openEHR has changed the landscape of information
communication between clinical information systems. Clinical archetypes as the core part of the
innovation in a two-level EHR have
influenced and inspired many advances in modern EHR systems. Meanwhile, the clinical
terminologies, which are medical vocabularies and representation of medical knowledge, 
are playing a key role in the movement of semantic interoperability. The integration of two-level
EHRs and clinical terminologies has been seen as an essential theme to produce better healthcare
services via a more connected e-health environment. The terminological shadow approach in this
thesis provides a new way of looking at the integration problems. This chapter presents
two applications to demonstrate how terminological shadows could impact clinical information
modelling with archetypes and SNOMED-CT. Specifically,
terminological shadows have been applied to 
\begin{enumerate}
  \item Discover the clinical coverage of existing archetypes. The study generates 
    important findings that may interest clinical archetype modellers and 
    SNOMED-CT terminology modellers. 
  \item Measure the semantic similarity of two archetypes. This experiment attempts to compare
    archetypes to find closely related clinical content.
\end{enumerate}    
These applications extend the evaluation of
terminological shadows that has been described in chapter \ref{ch:result} and provide guidelines for those
who wish to utilise terminological shadows to improve the integration between EHR information model
and clinical terminology. 







% ----------------------------------------
% paper 2 stuff
% ----------------------------------------



\section{Application 1: The discovery of clinical coverage of an archetype repository over SNOMED-CT}
\label{sec:paper2}
To demonstrate that terminological shadows are flexible artefacts that can facilitate the integration of
EHR models and clinical terminologies, the author, as part of this work aims to 
find the clinical concept coverage of
an archetype repository by analysing the terminological shadows that are generated from the
repository. This section presents an example of how to utilise the shadow approach to contribute to
information modelling in EHR and terminologies. The clinical coverage refers to the clinical content
that has been covered by existing archetypes, which are measured by the SNOMED-CT concepts of the
terminological shadows. The definition of clinical coverage and the method to obtain the coverage
result will be discussed in the following sections. 


\subsection{Motivation}
% TODO: This experiment deals with the coverage of clinical content, the scope of the work 
% is limited to the following aspects - .. The reason for scope is ..
\label{sec:paper2-motivation}
It was noted in section \ref{sec:cen13606} and \ref{sec:openehr} that
the specifications released by the openEHR foundation and as part of the CEN/ISO EN13606 standard
for EHR communication \parencite{open13606rel,dogac2005ehrstdsurvey} define among other things, an archetype model
\parencite{openehr2007aom}. These specifications describe how to create  archetypes to express constraints of clinical
information in an EHR by using a formal language called Archetype Definition Language (ADL).   Archetypes, are
information modelling artifacts that are used in advanced e-health modelling methodologies and the
number of archetypes is growing steadily to cover many clinical specialties. As noted in section
\ref{sec:repo}, many repositories of archetypes have been created for various projects around the
world with a growing number of archetypes. In order to accommodate
the growing number of archetypes, web-based archetype authoring and management platforms called
\emph{Clinical Knowledge Manager} (CKM) have been
developed to maintain the repositories of archetypes \parencite{garde2006towards}.  
Although different repositories have evolved to respond to differing needs,
it must be noted that there has been a relatively large scale adoption of the archetypes in the CKM
repository. These CKM instances 
may also include the capability of organising and categorising archetypes under certain classifications.

SNOMED-CT \parencite{sno2008manual} in contrast (or, rather, as a complement) to the contents of an archetype
repository, is intended to facilitate semantic interoperability by providing a comprehensive set of
commonly understood clinical concepts. If SNOMED-CT is to deliver on this objective, it will
increasingly cover the space of clinical findings, diagnosis, anatomy, drugs, and physical objects
that relate to health-care activity.  SNOMED-CT with its large scope is a powerful external
reference in the medical domain with a capability to enable communicating parties to convey
unambiguous clinical concepts. 

The completeness of clinical concept modelling in each medical domain
provides health professionals with a sufficient number of concepts to express clinical statements in
many clinical scenarios \parencite{snocompl2}. Thus it can be considered as an appropriate
metric to use when measuring the coverage of clinical concepts of an archetype repository.  However
at the time of writing, repositories of archetypes, while they encapsulate clinical knowledge, do
not show the clinical coverage of this encapsulated clinical knowledge with respect to knowledge
contained in terminological systems such as SNOMED-CT.  Little published work is found to report the
clinical content coverage of archetypes with respect to the terminology in clinical terminology
systems, and the distribution of clinical concepts. Nor is there a detailed description in the
literature of a method to obtain coverage information of this type.  Information of this type would
help to identify clinical territory where archetypes provide insufficient support for particular
areas of interest and will help to pinpoint areas where more archetype development work is required.


Many archetype terms are intended to link to universally understood medical references, such as
SNOMED-CT codes, so that when sharing archetypes, the meaning can be conveyed to the communicating
parties. This use of external coding systems has the potential to provide a variety of means to
classify and categorise a large number of archetypes. However, not all archetype terms are linked or
linkable with external terminological references. Nevertheless, the presence of two large and
related information resources, an archetype repository and a terminological system, presents an
opportunity to investigate how clinical information is modelled in contemporary repositories and to
identify the relative level of coverage for different categories of clinical terms. Using an
available resource of tens of thousands of archetype nodes and the information that they represent,
this application seeks to help find a terminological way of classifying them while gaining an understanding
of distribution of clinical concepts within an archetype repository.

The primary focus of this investigation is to identify the coverage of SNOMED-CT concepts in an archetype
repository. By mapping terms that are defined internally in Archetypes to standard SNOMED-CT
concepts, it is possible to obtain an approximate overview of the equivalent terminological content
of the archetype repository. The motivation of this application is to facilitate the linking and
harmonisation of two modelling approaches in health information systems. An archetype model is a
representative of clinical meta-data modelling methods that expresses the commonly agreed clinical
contents of EHRs.  On the other hand, SNOMED-CT is representative of clinical terminology, and
models clinical knowledge using an ontological approach. Both modelling approaches are designed to
work with EHR systems for different purposes, but gaps in coverage, overlaps and similarities are
likely between the two approaches. As discussed in \textcite{Qamar2008, bisbal2009arch-align}, increasing
integration of the two approaches will facilitate semantic interoperability in e-health.

It is hoped that the results will allow both communities (archetype modellers and terminologists) to
address interoperability issues that arise due to embedded codes. In addition, it is hoped that the
outcomes of this research will allow the identification of which categories in SNOMED-CT are more
thoroughly covered by modelling artifacts like archetypes, and which ones still require more
attention. These results should help in focusing the significant modelling efforts undertaken by the
community.



%\subsubsection{Term binding in ADL}
%
%Archetypes, even when not referring to clinical terminologies, may contain tens if not hundreds of
%freely designed data nodes that express clinical meanings. %DAMON ref for this statement As part of
%the archetype specification, the syntax of the ADL language includes a mechanism to allow annotation
%of clinical concepts in archetypes by defining local terms. These local terms are specified as ``AT
%codes'', where the `AT' stands for `Archetype Term'. A dedicated section is provided in each
%archetype to expand the explicit meaning of these terms and occasionally a ``presentation name'' for
%display on a screen. Archetype terms can also be linked to terms in external terminologies such as
%SNOMED-CT, known as term binding in the ADL syntax. These bindings to local `AT' coded terms in
%archetypes can be used to retrieve a commonly understood medical definition. The following example
%shows a snippet of such ADL syntax where the locally defined code ``at0021'' is linked to SNOMED-CT
%code 162465004.
%\begin{verbatim}
%...
%ELEMENT[at0021] occurrences matches {0..1} matches {--Severity
%	value matches {
%			1|[local::at0044], 	-- trivial 
%			2|[local::at0023], 	-- mild
%			5|[local::at0024], 	-- moderate
%			8|[local::at0025], 	-- severe
%			9|[local::at0045]  	-- very severe
%				}
%...	
%term_bindings = <	
%	["SNOMED-CT"] = <	
%		  items = <		
%		  ["at0021"] = <[SNOMED-CT::162465004]>
%...		  
%\end{verbatim}
%

\subsection{Top categories of SNOMED-CT}
\label{sec:categ-sno}
To support the coverage assessment in this application, the hierarchical structure of SNOMED-CT is studied
and adopted as the base of the coverage calculation. 
The hierarchical structure has been already described in section \ref{sec:ontosno}. 
The following diagram discusses more details about the top levels of the SNOMED-CT hierarchy.
As shown in Figure \ref{SNO},
`SNOMED-CT' is the root node of all concepts and its 19 first level categories in the concept model
range from \emph{body structure} to \emph{physical object}. Each category represents an abstract
clinical classification, each of which are sub-classified in turn by second level categories.
Second level categories again sub-classified into more specific clinical categories such as
``Finding by site'' and ``Disease''. This structure continues further down the
hierarchy until the most specific and detailed concepts are reached. The SNOMED-CT concept model allows
multiple-inheritance, which means that a concept may both belong to different parent categories. In
this diagram, the node `Disorder by body site' is a child node of both `Finding by site' and
category `Disease'. 

\begin{figure}[!htbp]
\begin{center}
\includegraphics[width=\linewidth]{../res/SNO_stru}
\end{center}
\caption{SNOMED-CT top categories and multi-inheritance}
\label{SNO}
\end{figure}


This investigation also looked at the number of concepts in each top first and second level
categories.
Table \ref{sno_report2} lists the information for all 19 first level categories of the
SNOMED-CT release that was used in the study.  The numbers indicate the size, i.e.\ the total number
of concepts under a first level category, which are listed in descending order.  All concepts have
been indexed and stored in the database for the mapping process. These figures are used to produce
the coverage of SNOMED-CT concepts by archetypes in the following sections.


\begin{table}\footnotesize
\begin{center}
\begin{tabular}{ l  l }
\textbf{First level} & \textbf{Number of}\\
\textbf{category name}& \textbf{concepts}(size)\\
\hline
Clinical finding (finding) &    109311\\
Special concept (special concept)&      67342\\
Procedure (procedure)&  53854\\
Body structure (body structure) &       31837\\
Organism (organism)&    27952\\
Substance (substance)&  23456\\
Pharmaceutical / biologic product (product)&    19084\\
Qualifier value (qualifier value)&      8904\\
Event (event)&  8447\\
Observable entity (observable entity)&  7834\\
Social context (social concept)&        5252\\
Situation with explicit context (situation)&    4912\\
Physical object (physical object)&      4515\\
Environment or geographical location (environment / location)&  1741\\
Linkage concept (linkage concept)&      1136\\
Staging and scales (staging scale)&     1113\\
Specimen (specimen)&    1055\\
Record artifact (record artifact)&      202\\
Physical force (physical force)&        172\\
Total concepts: & \textbf{378111}\\
\hline
SNOMED-CT version: & Jan 2008\\
\end{tabular}
\end{center}
\caption{First level categories with the number of concepts}
\label{sno_report2}
\end{table}





\subsection{Method}
\label{method}
This section describes how to apply the ``Terminological Shadow'' technique on a whole archetype
repository and map nodes in archetypes to SNOMED-CT codes to obtain a clinical coverage overview.  
It was previously mentioned that a vector-space based indexing tool called Lucene was utilised in
the terminological shadow to achieve automatic
archetype terms and SNOMED-CT concept mapping without using an external thesaurus. In this
application the framework was used to generate shadows for a whole archetype repository. 
A ``terminological Shadow'' has been defined 
in chapter 4 as an artefact that contains the corresponding SNOMED-CT concepts that a single archetype or set
of archetypes may be associated with. It is a result of mapping archetype nodes to their equivalent
terminological references, for example SNOMED-CT concepts, to represent the semantics of an
archetype in the language of clinical concepts. Together with context information such as the
clinical purpose and data types of each node in the archetype, the shadow approach is applicable to several semantic
tasks on archetypes. `Shadows', as the end product of a mapping process that in this case is applied
to all archetypes in a repository, can help to determine how many clinical concepts are being
covered by the archetype modellers.








The main focus of this aspect of the work is to identify the coverage and the distribution of SNOMED-CT
top two level categories in an archetype repository.  For the purposes of this application, \emph{coverage}
is defined as the ratio of the number of mapped concepts to the number of concepts in the entire
category expressed as a percentage.  The rationale is to study how fully the growing corpus of
Archetypes models the clinical domain and to identify apparent gaps in archetype sets , which should
drive further research and modelling efforts. `Shadows' contain information on all archetype nodes
as well as links to the corresponding SNOMED-CT concepts and also to the SNOMED-CT categories to
which they belong. Therefore a quantitative study can be performed to reveal the coverage of
clinical content in SNOMED-CT space. The method of calculating the coverage of SNOMED-CT categories in an archetype repository involves 

\begin{enumerate}
\item Extracting text from archetype nodes, 
\item Mapping these nodes to SNOMED-CT codes, 
\item Reporting the categories of the resulting codes in SNOMED-CT.
\item Computing the coverage
\end{enumerate}
The following subsections will expand 
details of each step to obtain the coverage information.




\subsubsection{Extraction of archetype terms}

To prepare the coverage calculation, archetypes from the 
NHS repository \parencite{nhs_repo} were used. This repository was first introduced in
section \ref{sec:repo} in chapter 2. 
These archetypes are chosen because it has a large quantity of archetypes and a diversity of
clinical disciplines. It was noted that the NHS repository contains archetypes that are designed for
different clinical purposes. It is believed that the clinical content of the NHS archetypes will
cover a wide area in SNOMED-CT.
The archetypes are available in the public domain, and have undergone broad internal
review by expert clinicians prior to being approved for NHS usage as exemplar clinical models
\parencite{nhs_review}. Table \ref{nhs_report} lists the number of archetypes per reference
model type in the NHS repository and the total number of nodes.

\begin{table}\footnotesize
\begin{center}
\begin{tabular}{ l  l }
 Reference model concepts & Number of archetypes based on this concept\\
\hline
cluster: & 300\\
composition: & 18\\
element: & 5\\
entry: & 557\\
section: & 96\\
structure: & 99\\
\hline
Total archetypes: & 1075\\
\end{tabular}
\end{center}
\caption{Breakdown of types of archetypes  in the NHS-CfH archetype repository }
\label{nhs_report}
\end{table}


The pre-processing of archetypes involves taking the latest version of an archetype and resolving
inheritance to reduce the redundant nodes which repeat their parent archetypes' base nodes. The
resulting output comprises of a list of unique archetype terms which represent the clinical
information in the repository as a whole. As listed in Table \ref{nhs_report} and
Table \ref{nhs_report2}, 8362  distinct archetype terms are the result of
extraction\footnote{Of the 8362 archetype terms that were extracted, 7925 mapped directly to
SNOMED-CT concepts while the preprocessing step failed to find corresponding SNOMED-CT concepts for
the remaining 473 archetype terms}. There are 4982 uniquely mapped SNOMED-CT concepts when
duplicates are removed.


\begin{table}\footnotesize
\begin{center}
\begin{tabular}{ l  l }
Result of extraction and mapping process & \# terms found\\
\hline
unique AT codes: & 8362\\
mapped unique concepts: & 4982\\

\end{tabular}
\end{center}
\caption{Number of AT and mapped SNOMED-CT concepts in the NHS-CfH archetype repository }
\label{nhs_report2}
\end{table}







\subsubsection{Mapping to SNOMED-CT}

This step takes advantage of the terminological `Shadow' approach which in this case is used to automatically map every archetype term 
to a single SNOMED-CT concept. Then the second level categories of the mapped concepts are fetched to obtain the coverage information. One way 
of visualising the `Shadowing' operation is to
`project' archetype terms to SNOMED-CT to see which category they belong to. 
Readers might find Figure \ref{shadow} similar to Figure \ref{shadow_abstr} in Chapter
\ref{sec:sno-ehrcontext} which initially introduces the `Terminological Shadow' concept. Here Figure
\ref{shadow} illustrates how the shadow approach can be used to estimate clinical content coverage
of an archetype repository. By `projecting' archetype terms onto SNOMED-CT space and looking up the
second level categories in SNOMED-CT, the coverage of the clinical concepts in archetypes could be
obtained.

The reason for choosing second level categories as the focus of coverage calculations is that due to the granularity of SNOMED-CT, as shown in
Table \ref{sno_report2}, many of the 19 level one categories are very large and too abstract to be useful in categorising terms for the
purposes of single archetypes. The 345 second level categories that are indicated in Figure \ref{SNO}, on the other hand, while still very
general are in the author's view, sufficiently specific to represent useful divisions of clinical topics corresponding to archetype information. 



\begin{figure}[!htbp]
\begin{center}
\includegraphics[scale=0.7]{shadow}
\end{center}
\caption{Calculating clinical content coverage of archetypes by creating terminological shadows}
\label{shadow}
\end{figure}



The algorithm used for the mapping process has been described in a previously published work
 by the author \parencite{syu2010}. It used  the vector-space
search tool, Lucene to find the most relevant concept among all indexed SNOMED-CT concepts for strings produced from each unique archetype term. The steps in the algorithm are as follows: 
\begin{enumerate}
\item For each archetype, extract the text of all `AT' codes from archetypes in the repository, and for each text string, search for associated SNOMED-CT concepts. 
\item The search will normally return a list of candidate SNOMED-CT codes.  Rank these codes in a list, by placing those that are considered best answers at the top of the list. 
\item Select the top answer from the list for each archetype term in an archetype to produce its terminological `Shadow'.
\item After the `Shadows' of all archetypes are fully constructed, the process seeks the first and second top level category of every mapped SNOMED-CT concept
in its hierarchy. 
\end{enumerate}
By tracing this hierarchy, the second level category from the root of the result is stored as the category of the mapped archetype
term. For example, if the text string `Crutches' is mapped to the SNOMED-CT concept `Crutches (physical object)', its second level antecedent is
\emph{Device (physical object)}. The count of uniquely mapped concepts for that category is incremented by 1 and so on for every other mapped  archetype
term. In this way by the end of the process, figures for all `AT' terms and their mapped second level categories are stored and ready for the
computation of coverage values. 



Occasionally and internally the mapping process will encounter an inactive  concept that is nevertheless suggested as a mapping. 
This is due to the life cycle of
SNOMED-CT concept modelling; certain concepts will be marked `inactive' but are not deleted. The algorithm will then try to replace the inactive concept
with the most relevant concept i.e.\ second concept on the list returned which is not inactive. 
Although the mapped SNOMED-CT concepts are the approximation of the original archetype terms
since it is an automated task compared to human operation, their second level categories can still reveal the covered clinical content of an archetype
repository.



\subsubsection{Definition of SNOMED-CT coverage}

For the data that are gathered from the `Shadows', the coverage of a SNOMED-CT top level category in an archetype repository is defined as the ratio of mapped
unique SNOMED-CT concepts found in a SNOMED-CT second level category to the size of that category. Figure \ref{shadow} visually illustrates
that the coverage is the ratio of `Shadowed' concepts in a second level category, represented by the dashed deep blue area, to the total number of
concepts in this category which is the dashed blue circle. \textbf{Equation \ref{eq1}} shows how to find coverage of a particular category \emph{T}. 

\begin{equation} \label{eq1}
coverage = \frac{\textit{mapped SNOMED-CT concepts in T}}{\textit{Size of T}}
\end{equation}
	\myequations{Equation number \ref{eq1}}

\begin{itemize}
	\item where \textit{mapped SNOMED-CT concepts in T} is the total uniquely mapped SNOMED-CT concepts belonging to category \emph{T}, 

	\item and \textit{Size of T} is the total number of concepts category \emph{T} contains.
\end{itemize}
This mechanism is used to establish among categories containing all mapped SNOMED-CT concepts, which categories exhibit higher coverage than others. 
After establishing the categories of all the mapped concepts, the coverage for all second level categories are recorded. 





\subsection{Related work}
\label{relwk} 
%Due to the textual nature of clinical information, association of medical text and
%codes in a standard terminology usually involves complex mapping processes.  Many existing research
%work involves the mapping of arbitrary clinical text with standard terminologies, which can be
%considered as a generalised case of archetype terms and SNOMED-CT concept mapping, whether manually
%or automatically. Related research work can be broken into three major categories with slightly
%different aims.  



Beside the related work that has been presented in chapter 3, this section provides more
specifically relevant research that is compatible but not overlapping with the aspects of this
thesis that relate to the 
discovery of clinical coverage of archetypes.
Lezcano et al.~used
a lexical tool from UMLS thesaurus, which stores pre-mapped arbitrary text and UMLS concepts in an
index, to associate archetypes with UMLS concepts. The resulting clusters of UMLS concepts were used
to build a graph to facilitate the semantic classification of archetypes \parencite{umls2010}. Lezcano
noted that many applications could be derived from the graph model approach such as the enhancement
of browsing and navigation of archetype repositories. It was concluded that the technique used by
Lezcano can be applied on any number of archetypes to automatically classify archetypes into
semantic clusters. However the `intersections' of archetypes are largely depending on the common
terms that are used in different archetypes. Therefore their focus remains on identifying common semantic
content among archetypes and analysing it using graph techniques.

The major difference between the related research presented above and the application presented in
this thesis is that this
research aims to find out the distribution of SNOMED-CT clinical concepts in an archetype
repository. The mapping method does not rely on pre-mapped thesaurus or medical text processing
software. Because it is an algorithmic method, other clinical terminologies can also be
investigated.







\subsection{Results}
\label{res}
The coverage results of the first and second level categories are rendered and interpreted separately in this section to show the outcome of the study. All archetype terms in the repository are mapped to 186 second level categories out of 345 total SNOMED-CT second level categories. In order to classify all the mapped SNOMED-CT concepts and see how their original archetype terms cover the first level SNOMED-CT categories, Figure \ref{pie} is generated to present the overview of mapped SNOMED-CT concepts in the perspective of first level categories. The pie chart shows how the continuum of mapped categories breaks into relatively rare, moderately popular and popular parts. Table \ref{bigt} shows part of the results with details of each mapped second level category and its uniquely mapped concepts, size and coverage.



\subsubsection{First level overview}
\label{sec:1st-lvl}
Figure \ref{pie} gives an overview of the distribution of first level categories in the archetype repository. Each section of the chart shows
the quantity of mapped concepts among first level categories. The first number following the name of the category is the number of mapped concepts
with duplicates that belong to this category \footnote{The percentage is the fraction of the number of mapped terms in the current category divided by
the total mapped concepts which is 7925}.



From the pie chart one can easily conclude that the majority of the mapped concepts are from
\emph{Clinical finding}, which implies that most of the information in archetypes cover this area.
This makes sense since most of archetypes in the repository tend to contain constraints on clinical
findings and the \emph{Clinical finding} category is the most populous as shown in Table
\ref{sno_report2}. The second among the most popular categories is \emph{Qualifier value}, which
makes up one fifth of the total  number of mapped concepts. However, referring to Table
\ref{sno_report2}, this category is  not among the biggest in SNOMED-CT with a total number of 8904
concepts. The smaller but popular categories such as \emph{Qualifier value} will be discussed later.
The third largest group of mapped concepts belong to \emph{Procedure} and the remaining are
contained in \emph{Observable entity} and \emph{Body structure} and so on as the number of mapped
concepts per category declines. 

% todo -- Refer to this section in the ``uneven distribution of SNOMED CT codes'' remark



\begin{figure}[!htbp]

\begin{center}

\includegraphics[scale=0.6]{fig1}

\end{center}

\caption{Coverage of first level categories}

\label{pie}

\end{figure}







\subsubsection{Second level coverage}

From the perspective of second level categories of SNOMED-CT, all 7926 concepts that were mapped during this study are broken into 186 categories. 
While there are total of 345 second level categories, the ones that are not mapped to any archetype term represent areas that the repository does not cover. 
Notably among these categories, 14833006 Cardiovascular drug (product) with a size of 926 seems to be the largest SNOMED-CT level two category that is
missed out and so is not contained in Table \ref{bigt}. Within the mapped categories, in quantity, some are very rare such as
\emph{Geographical and or political region} and \emph{Radiation therapy observable} each of which has only one uniquely mapped concept respectively. 
They represent the minority of mapped second level categories of the archetype repository in  contrast to the most popular category which is
\emph{Clinical history and observation findings}. Table \ref{bigt} lists these details in the format of `Number of uniquely mapped  terms',
which is the number of uniquely mapped concepts in this category; `Category Size' as total number of concepts under this category; and their `Coverage' is calculated using \textbf{Equation \ref{eq1}}. All second level categories are grouped according to their membership of first level categories and sorted in descending
order of their coverage within each first level category. Table \ref{bigt} shows some of the results of the study, focusing on those categories with a
minimum size of 500 concepts\footnote{The reason for presenting this subset is to focus on categories that have a relatively large number of concepts and to
investigate their coverage in the repository. 
The complete results including the results for smaller-sized categories and their coverage can be
found on \url{http://www.ehrland.ie/down_demo.html}.}.  The
contrast between the category sizes and the coverage can lead to investigation of particular categories by interested experts. 

\begin{center}
\footnotesize
\begin{longtable}{ l l l l l l}



\multicolumn{2}{l}{\emph{SNOMED-CT first and second}} & \#\emph{Uniq. mapped} &\emph{Categ.} & \emph{Coverage} \\ 

\multicolumn{2}{l}{\emph{level category name}} & \emph{SNO. concept} & \emph{size}& ~\% \\ 
\hline 

\multicolumn{2}{l}{a. \textbf{Body structure}} \\

	& 	1. Body structure, altered from its  	& 	55 	& 	4671 	& 	1.177\\

	& 	{original anatomical structure } 	& 		& 	 	& 	\\
	&	{(morphologic abnormality)}	& & & \\
	& 	2. Physical anatomical entity (body structure) 	& 	229 	& 	27005 	& 	0.847\\

	



\multicolumn{2}{l}{b. \textbf{Clinical finding}} \\

	& 	1. General clinical state finding (finding) 	& 	32 	& 	642 	& 	4.984\\

	& 	2. Administrative statuses (finding) 	& 	114 	& 	2404 	& 	4.742\\

	& 	3. Clinical history and observation findings (finding) 	& 	699 	& 	16319 	& 	4.283\\

	& 	4. Finding by method (finding) 	& 	157 	& 	4430 	& 	3.544\\

	& 	5. Neurological finding (finding) 	& 	61 	& 	1904 	& 	3.203\\

	& 	6. Finding by site (finding) 	& 	432 	& 	53966 	& 	0.801\\

	& 	7. Disease (disorder) 	& 	174 	& 	25287 	& 	0.688\\

	& 	8. Wound finding (finding) 	& 	13 	& 	2643 	& 	0.492\\



\multicolumn{2}{l}{c. \textbf{Environment or geographical location}} \\



	& 	1. Environment (environment) 	& 	82  	& 	1133 	& 	7.237\\

	& 	2. Geographical and/or political region   	& 	1 	& 	607 	& 	0.165\\
	&	{(geographic location)} & & & \\


\multicolumn{2}{l}{d. \textbf{Event}} \\

	& 	1. Accidental event (event) 	& 	9 	& 	5106 	& 	0.176\\

	& 	2. Exposure to potentially harmful entity (event) 	& 	2 	& 	2046 	& 	0.0978\\



\multicolumn{2}{l}{e. \textbf{Linkage concept}} \\ 

	& 	1. Attribute (attribute) 	& 	141 	& 	1127 	& 	12.511\\



\multicolumn{2}{l}{f. \textbf{Observable entity}} \\

	& 	1. Function (observable entity) 	& 	84 	& 	1404 	& 	5.983\\

	& 	2. Clinical history/examination observable  	& 	205 	& 	3937 	& 	5.207\\
	&	{(observable entity)}	& & & \\
	& 	3. Feature of entity (observable entity) 	& 	22 	& 	773 	& 	2.846\\



\multicolumn{2}{l}{g. \textbf{Organism}} \\

	& 	1. Trophic life form (organism) 	& 	1  	& 	503 	& 	0.199\\

	& 	2. Kingdom Animalia (organism) 	& 	24 	& 	13256 	& 	0.181\\

	& 	3. Kingdom Plantae (organism) 	& 	3 	& 	1935 	& 	0.155\\

	& 	4. Pathogenic organism (organism) 	& 	1  	& 	678 	& 	0.147\\

	& 	5. Microorganism (organism) 	& 	9  	& 	11413 	& 	0.0789\\



\multicolumn{2}{l}{h. \textbf{Pharmaceutical / biologic product}}  \\

	& 	1. Veterinary proprietary drug AND/OR  	& 	18 	& 	2533 	& 	0.711\\
	&	{biological (product)}	& & & \\
	& 	2. Replacement preparation (product) 	& 	3  	& 	576 	& 	0.521\\

	& 	3. Analgesic (product) 	& 	4 	& 	981 	& 	0.408\\

	& 	4. Hematologic drug (product) 	& 	2  	& 	510 	& 	0.392\\

	& 	5. Autonomic drug (product) 	& 	2  	& 	546 	& 	0.366\\

	& 	6. Biological agent (product) 	& 	2 	& 	829 	& 	0.241\\

	& 	7. Hormones, synthetic substitutes and  	& 	2 	& 	1271 	& 	0.157\\
	&	{antagonists (product)}	& & & \\
	& 	8. Skin agent (product) 	& 	1 	& 	698 	& 	0.143\\

	& 	9. Gastrointestinal drug (product) 	& 	1 	& 	778 	& 	0.129\\

	& 	10. CNS drug (product) 	& 	1 	& 	1252 	& 	0.0799\\

	& 	11. Diagnostic aid (product) 	& 	1  	& 	1282 	& 	0.078\\

	& 	12. Anti-infective agent (product) 	& 	1 	& 	1785 	& 	0.056\\



\multicolumn{2}{l}{i. \textbf{Physical object}} \\

	& 	1. Device (physical object) 	& 	117 	& 	3758 	& 	3.113\\



\multicolumn{2}{l}{j. \textbf{Procedure}} \\

	& 	1. Regimes and therapies (regime/therapy) 	& 	75 	& 	1169 	& 	6.416\\

	& 	2. Administrative procedure (procedure) 	& 	47 	& 	1313 	& 	3.58\\

	& 	3. Procedure with a procedure focus (procedure) 	& 	42 	& 	1276 	& 	3.292\\

	& 	4. Procedure by intent (procedure) 	& 	50 	& 	2185 	& 	2.288\\

	& 	5. Procedure with a clinical finding focus (procedure) 	& 	18 	& 	1080 	& 	1.667\\

	& 	6. Procedure by method (procedure) 	& 	220 	& 	22223 	& 	0.99\\

	& 	7. Procedure by device (procedure) 	& 	38 	& 	4619 	& 	0.823\\

	& 	8. Laboratory procedure (procedure) 	& 	67 	& 	8677 	& 	0.772\\

	& 	9. Procedure by site (procedure) 	& 	61 	& 	10093 	& 	0.604\\



\multicolumn{2}{l}{k. \textbf{Qualifier value}} \\

	& 	1. Descriptor (qualifier value) 	& 	256 	& 	1579 	& 	16.213\\

	& 	2. Spatial and relational concepts (qualifier value) 	& 	59  	& 	1075 	& 	5.488\\

	& 	3. Intellectual concepts and systems (qualifier value) 	& 	29  	& 	660 	& 	4.394\\

	& 	4. Unit (qualifier value) 	& 	40 	& 	1137 	& 	3.518\\

	& 	5. Ranked categories (qualifier value) 	& 	30 	& 	964 	& 	3.112\\



\multicolumn{2}{l}{l. \textbf{Situation with explicit context}} \\

	& 	1. Procedure with explicit context (situation) 	& 	50  	& 	1113 	& 	4.492\\

	& 	2. Past history of (situation) 	& 	20 	& 	672 	& 	2.976\\

	& 	3. [V]Factors influencing health status and  	& 	33  	& 	1237 	& 	2.668\\

	& 	contact with health services (situation) 	& 		& 	 	& 	\\

	& 	4. Finding with explicit context (situation) 	& 	39 	& 	1813 	& 	2.151\\

	

\multicolumn{2}{l}{m. \textbf{Social context}} \\

	& 	1. Occupation (occupation) 	& 	82 	& 	4395 	& 	1.866\\



\multicolumn{2}{l}{n. \textbf{Staging and scales}} \\ 

	& 	1. Assessment scales (assessment scale) 	& 	52 	& 	884 	& 	5.882\\

 

\multicolumn{2}{l}{o. \textbf{Substance}} \\  

	& 	1. Dietary substance (substance) 	& 	26  	& 	2000 	& 	1.3\\

	& 	2. Drug or medicament (substance) 	& 	10  	& 	1669 	& 	0.599\\

	& 	3. Substance categorized functionally (substance) 	& 	10  	& 	1742 	& 	0.574\\

	& 	4. Materials (substance) 	& 	8 	& 	1420 	& 	0.563\\

	& 	5. Allergen class (substance) 	& 	8 	& 	1708 	& 	0.468\\

	& 	6. Substance categorized structurally (substance) 	& 	47  	& 	11871 	& 	0.396\\

	& 	7. Biological substance (substance) 	& 	6  	& 	2304 	& 	0.26\\
\hline

	\caption{\label{bigt}Coverage of SNOMED-CT categories with size greater than 500}
\end{longtable}


\end{center}

The coverage of second level categories can be divided into three groups: well covered, moderately covered and rarely covered.  By
observing the spread of results, it is apparent that the coverage values for different categories,
vary from $7.237\%$ to $0.056\%$. 



Based on the above results, the boundary between the well and moderately covered categories could be placed in relative terms, at those categories
whose coverage is around 1.0\%. The boundary between the moderately and rarely covered is at 0.1\%. By using this metric, it is easy to visually
identify categories near the both ends of the coverage ``spectrum''. The following items are examples of identified categories with outstanding coverage. The
indicator in the curly bracket points to the location of the item in Table \ref{bigt}:



\begin{enumerate}

  \item Well covered categories

  \begin{enumerate}

    \item Clinical history and observation findings (finding) \{b.3\}

    \item Administrative statuses (finding) \{b.2\}

    \item Finding by method (finding) \{b.4\}

    \item Clinical history/examination observable (observable entity) \{f.2\}

    \item Descriptor (qualifier value) \{k.1\}

    \item Unit (qualifier value) \{k.4\}

    \item Attribute (attribute) \{e.1\}

    \item Environment (environment) \{c.1\} 

  \end{enumerate}

  \item Rarely covered categories

  \begin{enumerate}

    \item Microorganism (organism) \{g.5\}

    \item Kingdom Animalia (organism) \{g.2\}

    \item Anti-infective agent (product) \{h.12\}

    \item Diagnostic aid (product) \{h.11\}

    \item CNS drug (product) \{h.10\} 

  \end{enumerate}

\end{enumerate}

Among highlighted well covered categories, category b.3 has a relatively large size of 16319 concepts with a coverage of 4.28. Aligning to the
overview  of the first level categories, the mainly covered area of the archetype repository is clinical finding. Also similar to the first level
overview, the results show that certain relatively small categories, such as  k.1, k.4 and  e.1,
tend to be well covered. Notably in the SNOMED-CT model, concepts under category e.1
\emph{Attribute} have special usage which allow users to compose refined new concepts. Complex
clinical statements can be made by using attribute concepts to link existing concepts. Known as
post-coordination \parencite{sno2008manual},
this mechanism relies on linkage concepts like \emph{Attribute} to refine an existing concept and expand it by adding other concepts. Category k.1
\emph{Descriptor (qualifier value)} is well covered and it also has a special role in the SNOMED-CT model. For all second level categories under
\emph{Qualifier value}, they can be used as qualifiers in post-coordination to refine an existing concept. From these results, the  archetype
repository appears to cover these categories relatively well.



For rarely covered categories, it is clear that categories h.12, h.11, h.10 are all product related. Category g.5 \emph{Microorganism} is very poorly
covered and this could be of interest to archetype modellers. Professionals in specific fields should be 
able to find results that relate to their area of interest from the results. Modellers from both archetypes and SNOMED-CT communities can balance
future development based on these results.



\subsubsection{Frequency of term occurrence}



While extracting archetype terms from all the archetypes present in the repository, manual observation shows that certain ambiguous terms exist such
as `Location' and `Result'. After removing archetype terms in different versions and from inherited archetypes, such frequently used terms still
widely exist. Therefore the occurrence of the unique archetype terms in this repository are recorded. As a result of this process, Table
\ref{freq} lists the most frequently used terms. These terms appear to play significant roles in general archetype semantics, but not in a medical
sense. In the author's view, this is analogous to ``stop words'' such as `and', `of' in natural language processing. This material can be beneficial to separating terms of
this kind from less ambiguous terms, similar to the use of a stop word list in natural language
processing. In this experiment however these terms are
not excluded from the mapping process because it is important to know how well they are covered by  SNOMED-CT. 



\begin{table}\small
\begin{center}
\scalebox{0.9}{
\begin{tabular}{ l  l | l  l }

\hline

Archetype Term  &  Occurrence & Archetype Term & 

 Occurrence  \\

\hline

Tree	 & 	171 & Normal statements	 & 	27 \\

Any event	 & 	133 & None	 & 	27 \\

Event Series	 & 	113 & Findings	 & 	27 \\

no text for this at-code	 & 	77 & Name	 & 	25 \\

Comment	 & 	53 & Clinical Description	 & 	25 \\

Other	 & 	48 & Normal statement	 & 	24 \\

Description	 & 	43 & Right	 & 	23 \\

List	 & 	42 & Procedure outcome	 & 	23 \\

Comments	 & 	40 & Procedure comments	 & 	22 \\

Details	 & 	39 & Left	 & 	22 \\

History	 & 	32 & Person name performing procedure	 & 	20 \\

Normal	 & 	29 & & \\

\hline

\end{tabular}
}
\caption{Most frequently used free text terms}

\label{freq}
\end{center}
\end{table}








\subsection{Discussion} 
\label{discu}
This study aims to demonstrate the usefulness of terminological shadows by
estimating the coverage of clinical content of an archetype repository with
respect to SNOMED-CT. This study attempts to measure to what extent the
clinical content of SNOMED-CT is covered by existing archetypes. The
measurement of coverage has been defined in the study by SNOMED-CT top level
categories particularly the first and second level categories. The scope is
limited to this measurement because SNOMED-CT, considered as the most
comprehensive clinical concept network, has already a wide range of categories
at the second level. In theory however, this study can be extended to any
specific level of categories in SNOMED-CT. The intention of using second level
SNOMED-CT categories is to show a general clinical content coverage that is
agnostic to any specific medical field. 


\subsubsection{First and second level category coverage}
% TODO -- replace the footnote with ref to sec about templates

The results of the first level category overview and second level category coverage provide
researchers in both archetypes and SNOMED-CT an opportunity to identify areas of interest in a
number of ways. From an archetype developer's view, results such as those given in Table
~\ref{bigt} offer a mechanism to determine the status of an archetype repository. The `Terminological
Shadow' that is utilised in this application can provide a snapshot of how much clinical content of
SNOMED-CT has been covered within a particular repository. As development of an archetype repository
progresses, it can provide constant monitoring of medical topics that are being modelled to prevent
redundant information or unbalanced development. The technique was only applied to archetypes in
this case. As noted in section \ref{sec:archetypes} and section \ref{sec:cda}, 
CDA templates and openEHR templates were not massively
available when this study is completed but it could easily be extended to these resources 
when they become available in sufficient
numbers\footnote{openEHR specifications also include the idea of a template which allows archetypes
to be combined and further constrained to suit particular scenario that may arise in healthcare.
These templates also allow bindings to terminology systems.}. The approach presented here is
sufficiently flexible to be applied to any archetype repository, developed or growing.  The
threshold for determining when a category is well or rarely covered can be adjusted to particular
use cases. For this study the upper limit of coverage for certain categories is as high as 7\%. But
there is as yet no way to determine a satisfactory threshold for a SNOMED-CT category.  For example,
a national project's archetype repository should have higher coverage than a repository of a smaller
project. In
any case, the idea of an acceptable level of coverage for a whole repository is also still
undetermined. 



For SNOMED-CT modellers, this approach provides feedback from application of the SNOMED-CT concepts
in archetype modelling. It reveals the focused areas of medical concepts that are being modelled in
a well known EHR model. The popular categories reflect the overlapping areas where two communities
have a common focus, while the under-covered categories may indicate differences in modelling
priorities. Guidance on SNOMED-CT concept usage can be provided where coverage is not satisfactory.
\emph{Microorganism} as an example, exhibits much lower coverage than other categories therefore
investigation may reveal why so few archetypes use concepts from this large category. Medical
professionals in specific areas can also pay attention to their fields. For example, \emph{Kingdom
animalia} is a category with reasonably large size but is little covered in this repository. However
archetypes may be demanded by practitioners of veterinary medicine which require usage of many
concepts from this category.



Last but not least, the results of this study are a resource for archetype creation planning. For
instance, there are 11413 types of \emph{Microorganism} in this version of SNOMED-CT. When archetype
modelers decide to create archetypes and templates for microorganisms, the results show that there
is likely to be a requirement for either a query based binding mechanism or else a relatively large
quantity of bindings even to achieve a coverage of 1\% of the total number of terms in this
category. A similar analysis can be performed on many of the first level and second level categories
with similar results. Archetype authors may need to incorporate specific queries, large hierarchical
archetypes or even more sophisticated mechanisms to make these concepts available.  Assuming that
archetype modelers' goal is to cover as much clinical content as possible to be used with EHRs, the
results of the coverage of current archetypes can indicate the likely complexity of future
development and implementation.



Based on the observation of the above coverage results, the areas identified as well and rarely
covered also lead to discussion and comparison of the two health modelling patterns. The Archetype
Object Model is record-oriented, which means its model resembles the structure of document records.
While SNOMED-CT is ontology-oriented, meaning that it is a network of medical phenomena based on
logic. However similarities can be found between the two modelling approaches because the patterns
of organising information show some correspondence at certain levels.



The fact that categories such as \emph{Clinical finding} are well covered in the studied repository
could lead to a conclusion that the archetypes in the repository mainly focus on clinical finding
information. The results also show that the proportion of archetype terms mapped to the
\emph{Qualifier value} category is significantly higher than other categories regardless of its
small size. \emph{Attribute} has similarly been well covered despite its relatively small size. One
possible reason for this coverage information is that  many parts of archetypes use these concepts
frequently, for qualifying answers such as ``Mild'', ``Nil significant'' or attributes like
``Severity''. Figure \ref{dia} gives one example from the
\emph{openEHR-EHR-CLUSTER.symptom-pain.v1} archetype \parencite{pain} to show how these concepts are
linked to archetypes.
\begin{figure}[!htbp]
\begin{center}
\includegraphics[width=.8\textwidth]{../res/Dia}
\end{center}
\caption{Example of heavily used concepts}
\label{dia}
\end{figure}
The results of the category distribution
implies, to some extent, that some openEHR reference model classes are related to certain
clinical concept domain. The next section will discuss this phenomenon in more detail. 
%For instance the 


\subsubsection{Reference model classes verses SNOMED-CT categories}
\label{sec:snorm-class}
The results also show interesting relationship between the reference model classes of the archetype
terms and the categories of the mapped SNOMED-CT concepts. The following pie chart in Figure
\ref{pie_codephrase} shows the
distribution of SNOMED-CT categories among all archetype terms that are of reference model class
``CODEPHRASE''.
\begin{figure}[!htbp]
\begin{center}
\includegraphics[width=.7\textwidth]{pie_codephrase}
\end{center}
\caption{The distribution of SNOMED-CT categories of reference model type: CODEPHRASE}
\label{pie_codephrase}
\end{figure}
The areas with grey colours might be of interest to readers because these SNOMED-CT categories seem
to be largely associated with the given reference model type. In Figure \ref{pie_codephrase}, the largest
grey areas that have been identified are SNOMED-CT category \emph{Finding} and \emph{Qualifier
value}, which indicate a close link between
the categories to the reference model class ``CODEPHRASE''. It could be hypothesised that the
reference model class ``CODEPHRASE'' is closely aligned with SNOMED-CT category \emph{Qualifier
value} because they are used to present similar semantic content in healthcare. A code phrase can be
used to specify the value of a property in a complex clinical statement 
while a qualifier value in SNOMED-CT specifies a characteristic of a medical concept.
The break-down of categories for other major reference model classes, which include CLUSTER, ACTION, ELEMENT and
OBSERVATION, are shown in the following pie charts. 

% todo -- check if done Refer to this section in the remark.
Similar to ``CODEPHRASE'', each of the category break-down has identified one if not several
dominating SNOMED-CT categories which seem to be associated to the reference model class. As shown
in Figure \ref{pie_cluster} and Figure \ref{pie_act}, the class ``CLUSTER'' is most quantitatively associated to SNOMED-CT category
\emph{Finding}, but also has relatively similar association with SNOMED-CT category
\emph{Procedure}, \emph{Observable entity} and \emph{Qualifier value}. 

\begin{figure}[!htbp]
    \centering
    \includegraphics[width=.7\textwidth]{pie_cluster}
    \caption{SNOMED-CT category break-down of reference model class CLUSTER}
    \label{pie_cluster}
\end{figure}

\begin{figure}[!htbp]
    \centering
    \includegraphics[width=.7\textwidth]{../res/pie_act}
    \caption{SNOMED-CT category break-down of reference model class ACTION}
    \label{pie_act}
\end{figure}

These results provide interesting implications and observations. It could be hypothesised that the
reference model class ``CLUSTER'' is a general container that does not specify too much clinical
context. It could also imply that the ``CLUSTER'' class should be used with SNOMED-CT concepts that
are more general. It could also be inferred from the pie chart that the ``CLUSTER'' archetypes in
the NHS repository are mainly about clinical findings, procedure and certain clinical
characteristics such as pain symptom. On the other hand, the reference model 
class ``ACTION'' can be seen from the figure to be mostly linked to \emph{Procedure}. The result 
shows that the ``ACTION'' entry class in the openEHR information model is relatively more domain
specific. The association between ``ACTION'' and procedures in SNOMED-CT could imply that the
EHR information model of clinical actions is relatively aligned with the SNOMED-CT concept model.
This could be potentially helpful for future archetype developers who are creating archetypes for
clinical actions and developers who design algorithms to find SNOMED-CT concepts for action
archetypes.

Similar observations can be made from Figure \ref{pie_observation} and Figure \ref{pie_element}.
``ELEMENT'' class is evenly
dominated by \emph{Finding}, \emph{Qualifier value} and \emph{Observable entity}. 
Meanwhile
``OBSERVATION'' class is largely related to \emph{Finding}, \emph{Procedure} and \emph{Observable
entity}.  Therefore the ``ELEMENT'' class seems to be generic and the ``OBSERVATION'' class has
certain alignment with clinical findings in SNOMED-CT.
A full reference model classes verses SNOMED-CT categories table detailing more classes
could be found on the download page of the \emph{EHRland} project website.
\begin{figure}[!htbp]
    \centering
    \includegraphics[width=.7\textwidth]{pie_element}
    \caption{SNOMED-CT category break-down of reference model class ELEMENT}
    \label{pie_element}
\end{figure}



\begin{figure}[!htbp]
    \centering
    \includegraphics[width=.7\textwidth]{pie_observation}
    \caption{SNOMED-CT category break-down of reference model class OBSERVATION}
    \label{pie_observation}
\end{figure}



The main justification for using the top 2 levels of SNOMED-CT categories is to
have an unbiased coverage of clinical content. The second level of SNOMED-CT
concepts in particular are good candidates to represent general clinical
knowledge. The archetype repository that has been chosen is a general purpose
repository which may contain clinical information from all aspects of medical
care.  


The reference model classes are derived from clinical practices. They do
not necessarily correspond to the clinical knowledge that is being modelled by
formal ontologies such as SNOMED-CT. Depending on the purpose of the
terminology, some clinical controlled vocabulary does not have concepts that
are related to EHR information structures, for example Gene Ontology. In order
to make the terminological shadow approach generally applicable, this study
does not assume the reference model classes have corresponding concepts in the
terminologies. Clearly the use of reference model classes alongside
terminology-based considerations would be helpful to improve the effectiveness
of the shadow creation process. There are certainly reusable correspondences
between concepts in CDA, openEHR and EN13606 which would allow a limited form
of commonality in use of classes. However there are other clinical systems that
are built with reference models designed for different clinical purposes.




\clearpage




% ----------------------------------------
% paper 3 stuff
% ----------------------------------------


\section{Application 2: Archetype comparison by terminological shadow}
\label{sec:paper3}
The applicability of the terminological shadow approach is not limited to research. 
Section \ref{sec:paper2} has demonstrated that terminological shadows can be utilised to discover the clinical coverage of an archetype
repository. Shadows can also be used in applications that allow users to search, find and combine
information in the archetypes. The rapid growth of archetypes in a number of international
repositories mentioned in section \ref{sec:repo} makes
archetypes management a sometimes baffling problem. Requirements to search and access archetypes based on
semantic content are increasing in the health informatics community. For instance, there is a lack of tools to search
information based on the clinical content of the archetypes. If a user needs to look for an
archetype that contains a specific medical concept, there are few alternatives to relying on
basic string matching searches and manually opening archetypes in an archetype editor for
examination. This section focuses on a particular problem in archetype management that similar
archetypes with closely related clinical content may be developed in different repositories. Despite
different structures and descriptions, certain archetypes may often have very closely related  
semantic meanings. It would help to avoid duplicated modelling effort if these archetypes can be identified.  
The terminological shadow approach, however, could aid the situation by introducing an archetype
comparison function to measure the similarity between two archetypes. This section provides the
implementation detail of
an application of terminological shadows to measure the similarity 
between archetypes. By
describing the methodology and implementation, it attempts to exemplify the applicability of the
terminology shadow approach.



\subsection{Motivation}
Archetypes are frequently developed by health informaticians who intend to cover certain
medical scenarios. In the current archetype development and review model \parencite{alberto2011mie}, which allows
archetypes to be freely created but reviewed by experienced clinical experts,  
it seems inevitable that overlapping clinical content would appear in archetypes. As emphasised by
the specifications from the openEHR foundation, archetypes should be created as a highly
re-usable clinical content meta-data. Archetypes
with redundant information, although they sometimes may be necessary to fit the clinical requirements, 
can usually be merged or improved. Redundant information that repeatedly appearing in different
archetypes should probably be modelled as a reusable single archetype.

This type of redundancy in archetypes, which could be difficult to detect without human observation, 
is likely to be caused by developing archetypes to suit similar clinical purposes by different modellers using different
structures and descriptions. Although the archetype review process would present as a filter to
eliminate redundancy and advocate good modelling principles, it is relatively difficult to
prevent redundant clinical content from being created when new archetypes are developed. Archetype
modellers are not expected to be aware of all the clinical content that are covered by or
overlapped with existing archetypes in one or multiple repositories. To minimise the number of duplicate
archetypes that may contain clinical content covered by existing archetypes, an automated process
should be introduced to best `guess' the likeness of archetypes. The terminological shadows can be
adopted to produce a comparison function to measure the similarity of two archetypes.

The aim of creating an archetype comparison application is to demonstrate that
the terminological shadow approach is generally applicable in developing tools
that compare the semantic content of archetypes. This study aims to answer the
need of the archetype community to search and compare the semantic content of
archetypes. Terminological shadows are a good starting point to build semantic
tools that enable archetype developers and users to navigate and compare
archetype easily. Such functionality requires a fundamental mechanism to
measure the clinical content of archetypes in a semantic way. Tools built on
top of terminological shadows will exploit the SNOMED-CT concepts and compare
them in the SNOMED-CT network. This study gives one such example by applying
graph based techniques to achieve the measurement of archetype similarity


Additionally, detecting the similarity between archetypes can also facilitate the harmonisation of archetype
development with different reference models. By identifying similar archetypes that
are using different EHR reference models, good design patterns can be borrowed and considered to
improve clinical content modelling. In the future, this application could potentially extend 
its comparison function to support queries to find, search archetypes based on
SNOMED-CT concepts.

\subsection{Related work}
The method of comparing archetype with terminological shadows takes advantage of characteristics of
the SNOMED-CT concept network as a directed acyclic graph, which has been introduced in the
literature review. Before discussing the
technique for archetype comparison that is used in this application, 
a number of concepts need to be introduced. The development and
adoption of biomedical ontologies, which form the central pillars of modern clinical classification
and terminologies, has greatly improved the quality of clinical information. The clinical concepts
and entities described by the ontologies provide a means of measuring how closely related clinical phenomena 
are, which is referred to as \emph{Semantic Similarity} \parencite{rada1989development}. One example is the measurement of
similarity among the results of biomedical experiments such as gene expressions analysis. Similarity
measurement is a commonly required task in graph-based ontologies. Several methods can be used for
measuring similarity of clinical concepts in ontologies. The following subsections discuss the
strategy that is used in this application.

\subsubsection{Semantic similarity}
Similarity measurement is not a new idea. It has been a mainstream
research topic to determine the relatedness of two entities by using ontologies. It has 
applications across many fields include Information Retrieval, personalised recommendation, gene
expression similarity and so
on. These applications typically rely on dedicated ontologies to perform calculation to quantify the
relatedness of entities such as medical concepts, words, products and documents. The word
``semantic'' in \emph{Semantic Similarity} denotes the domain knowledge that are associated with the
ability to intelligently search, suggest and compare entities. Semantic similarity measurement is of
particular interest to bioinformaticians in the field of genomics to match, compare gene products
and functions. The Gene Ontology \parencite{ashburner2000gene}, as a primary controlled vocabulary to describe and annotate
genes, is an essential knowledge base for measuring the similarity of gene products for various research purposes.

Many strategies can be applied to measure similarity in an ontology. For instance, a widely used
ontological source for words and phrases of natural languages is WordNet. One straightforward
measurement of the semantic similarity of two concepts is their semantic distance in the
ontology. The closer the two words appear to each other in WordNet, the more closely their meanings 
are to each other semantically \parencite{Agirre95aproposal}. It has proposed that an important metric of similarity is the
minimum distance between the concepts, which are quantified by the number of edges that link the two concepts \parencite{rada1989development}. 
The measure of the distance between concept \emph{A} and \emph{B} can be represented as:
\[
Distance (A, B) = \text{minimum number of edges separating } a \text{ and } b
\]


As ontologies mostly appear as graphs or networks, with semantic relationships connecting each concept/entity, the
strategies for measuring semantic similarity are directly associated with graph theory. Concepts in
an ontology are considered as nodes in a graph or network. There exist many approaches
to rate the relatedness of two genes, which are described with terms from the Gene Ontology.
Variation of the semantic distance, or the measurement based on information content can be used to
quantify similarity. 

Many clinical ontologies, including SNOMED-CT, are developed as hierarchical graphs. The most common 
relationship between concepts in many clinical ontologies are
``IS-A'' \parencite{smith2005rel}. ``IS-A'' links two nodes in a concept network that the first concept can be classified
as the specialised case of the second concept. For example, a car ``IS-A'' a vehicle.
SNOMED-CT has been created as a Directed Acyclic Graph, in
which the nodes are connected via the ``IS-A'' relationship and by following the ``IS-A'' links it
does not return to the original node. Research in
developing semantic similarity measurement of SNOMED-CT concepts may improve search, data mining,
and knowledge discovery in databases that store clinical data or medical literatures. In this
application,
the measurement of semantic similarity in the SNOMED-CT ontology is used to provide the basis for 
archetype comparison. 



\subsubsection{Similarity measures}
It was noted in the previous section that there are many approaches to measure the semantic similarity of two
entities in a biomedical ontology. It was also noted some approaches are developed from algorithms
or techniques that are based on graphs in the field of information retrieval \parencite{rada1991document}.
% todo -- check all the utf8 char!!!
Catia Pesquita has summarised the common approaches for computing semantic
similarity \parencite{pesquita2009semantic}. The computational
methods that attempt to measure semantic similarity of two medical concepts 
in biomedical ontologies can be broadly categorised as \emph{edge-based} and \emph{node
based}\footnote{In this article Catia also includes similarity measures between groups of concepts.}.
The approaches classified as edge-based are mainly based on the measurement of the distance
between two concepts, such as the number of edges in the graph that separates them.
%Lee et al., 1993; Rada and Bicknell, 1989; Rada et al., 1989
Edge-based approaches also include measuring the shared path of the two concepts and
converting the number of edges into a similarity metric. 
It is commonly considered that edge-based approaches are more intuitive and easy to use to 
measure the similarity of two nodes in an ontology.
The node-based approach however, relies on a
concept that has been introduced by Resnik in his work \textcite{Resnik1995ss, Resnik92wordnetand} % resnik 1995
called \emph{Information Content} (IC). 
The information content of a concept in an ontology is a measure that incorporates
the probability of the occurrence of this concept in a given corpus, which may be a collection of
medical documents, clinical records, gene annotation and so on. 
The computation of the IC of a concept $c$ is shown as follows:
\[
IC(c) = -\log p(c)
\]
where $p(c)$ represents the probability of concept $c$ appearing in the corpus. 
IC is a metric that indicates how informative a concept is,  with the assumption that probabilistically
speaking less frequently occurring concepts convey more information. The IC measure has been commonly used in node-based 
approaches to address the issue in edge-based methods that edges represent the same semantic
distances \parencite{pesquita2009semantic}. The node-based approaches, which utilise the IC to assess the similarity between
concepts, are considered a balanced measure that ignores the differences of concept density in an
ontology and the variable semantic distance per edge. However, the IC relies on the occurrence of
concepts in a corpus, which may be biased because certain concepts appear more frequently in a
corpus than others.


%lowest comm.. edge-based similarity measurement strategy suitability of LCA..
\subsection{Comparing archetypes}
Archetypes are the artefacts that are produced through a process of clinical information modelling, which
contain clinical semantic content that is defined by clinical experts. 
The author believes that terminological shadows can improve the management of the semantic
content in archetypes by providing
better searching, finding, comparing and navigating functionalities for archetypes. In a comparison
with \emph{Semantic Web} technology, which aims to better organise semantic content on the
web by incorporating ontologies into web searches, terminological shadows utilise SNOMED-CT as
an ontology to facilitate better knowledge management for archetypes. 
The archetype comparison function extends the terminological shadow approach to
demonstrate the applicability of shadows in archetype management. 

There are potentially a number of ways to measure the similarity between archetypes. As mentioned in
the previous section, there exists two broad categories of similarity measurement strategies:
\emph{edge-based} and \emph{node-based} approaches. Evaluating all available semantic similarity measures is not the objective of this thesis.
However it is intuitive to choose a edge-based
approach for its simplicity to build an archetype comparison function. Node-based approaches on the
other hand, seem to rely on the probabilities that concepts occurring in a reasonably large
collection of samples. The current archetypes that have been gathered seem to lack variety and
quantity, which may lead to a biased result based on \emph{Information Content}. 
With the number of archetypes growing in many repositories, it may still need
time to become a comprehensive and large collection. Future work will be planned to evaluate whether
the current archetypes are adequate to use a node-based semantic measure. 


%-----------编辑书签-------------% 

In this application the comparison of two
archetypes is achieved by creating terminological shadows of the archetypes and using an
edge-based similarity measure called \emph{lowest common ancestor} (LCA) to generate a similarity score.
Figure \ref{similar} outlines of the basic steps of the archetype comparison function.
\begin{figure}[!htbp]
\begin{center}
\includegraphics[width=\textwidth]{../res/similar_proc}
\end{center}
\caption{An overview of the archetype comparison function}
\label{similar}
\end{figure}
It shows that the terminological shadow framework can be extended to incorporate a similarity
measure such as LCA to generate a similarity score for two archetypes. The detail of the method will
be discussed in the following sub-sections.


\subsubsection{Lowest common ancestor}
The terminological shadows created by the framework contain SNOMED-CT concepts, which are considered
the semantic equivalent of the clinical concepts in archetypes. These concepts, existing as nodes
in the SNOMED-CT hierarchical graph, are compared for their likeness. The similarity
measurement is mainly based on the computation of the lowest common ancestor of two concepts. A
``likeness'' score can be obtained between the two SNOMED-CT concepts.
A final score can be derived from all the ``likeness'' scores to indicate the relatedness of two
archetypes.

The lowest common ancestor (LCA) is an important concept in the field of graph theory.
The LCA of two nodes $a$ and $b$ in a rooted tree denotes the lowest node that both
subsumes node $a$ and $b$. The LCA identifies the closest shared parent of the two nodes in the hierarchical graph. 
Figure \ref{LCA} shows an example of LCA, where the black node is the least common ancestor of concept A
and B.
\begin{figure}[!htbp]
\begin{center}
\includegraphics[scale=.7]{../res/LCA}
\end{center}
\caption{An example of the LCA of two concepts in SNOMED-CT}
\label{LCA}
\end{figure}
The inclusion of LCA can be found in number of similarity measurement equations that aim to
quantify the relatedness of entities in ontologies. The LCA is easy to obtain in a
directed acyclic graph (DAG) by following the ``IS-A'' links. One of the defining characteristics of
a DAG graph is that 
all nodes have links pointing to their parent nodes. SNOMED-CT, which satisfies this requirement,
has each concept pointing to their predecessors by a ``IS-A''. The technique that is used in this
application to compute the similarity score is mainly based on LCA, which will be discussed in the next
sub-section.

\subsubsection{Compare archetypes via terminological shadows}
It has been demonstrated that archetypes can be transformed into
terminological shadows, which contain SNOMED-CT concepts as the equivalent of archetype terms.
The comparison of archetypes can be achieved by computing the similarity of SNOMED-CT concepts in
the shadows. The comparison method comprises of two steps: 
\begin{enumerate}[(a)]
  \item generating terminological shadows for the archetypes to be compared, 
  \item calculating a similarity score for each pair of SNOMED-CT concepts in the shadows and
obtaining a final score of relatedness.
\end{enumerate}

To initiate the comparison of two archetypes, their terminological shadows will first be created
by using the framework. Then the SNOMED-CT concepts in two shadows will be compared
individually in pairs to obtain a one-to-one similarity score. The process for pairing up the concepts
will be discussed in section \ref{sec:expsetting}. The similarity measure adopted in this study is an edge-based approach
first proposed by Pekar in \textcite{coling02edge}. 
% REF Taxonomy learning - factoring the structure of a taxonomy into a
% semantic classification decision
The rationale for using this method is coherent to the purpose of
the study, which is to demonstrate the ability of terminological shadows to semantically manage 
the clinical content in archetypes. Improving the efficiency of similarity measures is an on-going
research topic. At the time of writing, various new and old measures are continuously undergoing 
rigorous evaluation and investigation. This study intends to present a generic approach that
encompasses a non-specific similarity measure based on terminological shadows.
The calculation of the similarity score of two SNOMED-CT concept $c_1$ and $c_2$ is given by
Equation \ref{eq:sim}:
\begin{equation} \label{eq:sim}
sim(c_1,c_2)=\frac{\delta(c_a,root)}{\delta(c_a,root)+\delta(c_1,c_a)+\delta(c_2,c_a)}
\end{equation}
	\myequations{Equation number \ref{eq:sim}}
where $c_a$ is the lowest common ancestor of $c_1$ and $c_2$ in SNOMED-CT.
$\delta(c_a,root)$
denotes the number of edges between the LCA and the root node. The distances that are used in the equation
have been visualised in Figure \ref{similarity} where the edges are marked out. 
\begin{figure}[!htbp]
\begin{center}
\includegraphics[scale=.7]{../res/similarity}
\end{center}
\caption{The edges used in the calculation of the lowest common ancestor}
\label{similarity}
\end{figure}
The example in the figure shows that the edge count from the LCA of $c_1$ and $c_2$ to the root node
is three and the edges between concepts is three.
Figure \ref{similarity} illustrates how the similarity score is obtained with knowledge of the LCA
of two concepts in the SNOMED-CT ontology. There are many methods to determine the LCA for two nodes
in a graph. The method of finding the LCA used in this application are
explained in the experiment setting section \ref{sec:expsetting}.
The similarity score will be calculated for
all pairs of concepts from the terminological shadows and used to present an overall relatedness
score for the two archetypes. 



\subsubsection{Experiment setting for archetype comparison}
\label{sec:expsetting}
The experiment carried out in this application is to compare two pairs of archetypes using the method that
has been introduced to determine the closeness of archetypes. For the purpose of this demonstration,
three archetypes have been selected for comparison: \emph{body site},
\emph{anatomical location precise} and \emph{blood pressure}. The full names of these archetypes can
be found in Table \ref{arch_paper3}.
\begin{table}[!htbp]\small \begin{center} \begin{tabular}{ |r|l| }

  \hline
  Archetype full name & No. \\
  \hline
openEHR-EHR-CLUSTER.anatomical\_location-precise.v1.adl & 1\\
openEHR-EHR-CLUSTER.body\_site.v2.adl &    2\\
openEHR-EHR-OBSERVATION.blood\_pressure.v1.adl &  3\\
			 \hline
 		 \end{tabular} \end{center} 
		 \caption{Selected archetypes for comparison using LCA based similarity measure} 
		 \label{arch_paper3}
\end{table}
The archetypes of choice are selected heuristically by observation. The \emph{body site} and
\emph{anatomical location precise} archetypes are believed to have been created to cover very
similar clinical content in an EHR. The comparison between these two archetypes should reveal these
similarities. A third archetype, \emph{blood pressure}, which is regarded as quite a different
archetype with a specific purpose, is used here to compare with \emph{body site} to testify the
comparison approach. 

%All archetypes are obtained from the openEHR archetype repository and listed in
%Appendix [ref]. % All archetypes used in the thesis

The terminological shadows of these archetypes shown in Table \ref{arch_paper3} are generated by the
shadow creation framework. The SNOMED-CT concepts in the shadows, as the outcome of the creation,
are paired up and used in the comparison process. In this application, the SNOMED-CT concepts that are
paired and compared are the leaf nodes in the archetype node tree and arranged by their reference
model class names. Figure \ref{pairup} illustrates how the nodes are paired and compared. 
\begin{figure}[!htbp]
\begin{center}
\includegraphics[scale=.7]{../res/pairup}
\end{center}
\caption{SNOMED-CT concepts in the shadows are paired according to reference model class types}
\label{pairup}
\end{figure}
In the diagram each node that is of a particular shape represents an archetype node with its
corresponding SNOMED-CT concept in the terminological
shadow. The shapes indicate different reference model classes. As described earlier, the leaf nodes
that are of the same type of reference model class are compared. Each leaf node is compared to
all the leaf nodes that are of the same type in the other archetype. This decision has been influenced
by the observation that leaf nodes in archetypes tend to contain specific medical concepts.
Once the SNOMED-CT concept-to-concept pairs have been set up, their LCA are to be computed. 
There are many algorithms for calculating the LCA for two nodes in a graph. The method used in this
application is to obtain every possible path from the current two SNOMED-CT concepts to the root node. All paths
are compared to find the longest common path, which contains the lowest concept in SNOMED-CT to
subsume both concepts; hence the LCA of the two
concepts. 

%The detail of the algorithm can be found in Appendix [ref].

The calculation of the similarity score is per Equation \ref{eq:sim}. 
%A final score for the whole
%archetype is computed by obtaining an arithmetic mean of all the scores. % verify
The three selected archetypes form two comparisons: \emph{body site} against \emph{anatomical location
precise} and \emph{body site} against \emph{blood pressure}. The results of the comparisons are
presented in the new section.




\subsection{Results}
The results obtained from the comparisons give a demonstration that the similarity between archetypes
can be measured by their terminological shadows, which also reveals, semantically, the relationships between the
clinical content in two different archetypes. Table \ref{paper3_result1} shows the result of
similarity scores when comparing the archetype \emph{anatomical location precise} with
\emph{body site}.  % remember to add the final score
The first and second column of the table are the description of the archetype term for that archetype node. 
% need this appendix?
%The SNOMED-CT concepts that have been mapped to each node in both archetypes can be found in Appendix [ref]. 
The third column indicates the reference model class type of the
leaf node. The last column is the similarity score that is calculated using Equation \ref{eq:sim}.
\begin{center}
\footnotesize
\begin{longtable}[!htbp]{l | l | l | l}
  

  \hline
  \textbf{Anatomical location
  precise} & \textbf{Body site} & RM class & score\\
  \hline
Precise anatomical location  & Body site & CLUSTER & 0.166666666667 \\
Name of location& Body site & ELEMENT & 0.142857142857 \\
Name of location& Laterality & ELEMENT & 0.111111111111 \\
Name of location& Body site description & ELEMENT & 0.111111111111\\
Side& Body site & ELEMENT & 0.142857142857 \\
Side& Laterality& ELEMENT & 0.111111111111 \\ 
Side& Body site description& ELEMENT & 0.111111111111 \\
\hline
Left,Right,Bilateral & Left,Right,Bilateral,Unilateral  & CodePhrase & \textbf{0.47996031746}\\
& left,Unilateral right,Unilateral & & \\
\hline  

Numerical identifier& Body site & ELEMENT & 0.166666666667 \\
Numerical identifier& Laterality & ELEMENT & 0.125 \\
Numerical identifier& Body site description & ELEMENT & 0.125 \\
\hline
First,Second,... & Left,Right,Bilateral,Unilateral & CodePhrase & \textbf{0.321836419753} \\
Seventeenth,Eighteenth & left,Unilateral right,Unilateral & & \\
\hline
Anatomical plane& Body site & ELEMENT & 0.25 \\
Anatomical plane& Laterality & ELEMENT & 0.0909090909091 \\
Anatomical plane& Body site description & ELEMENT & 0.0909090909091 \\ 
\hline
Midline, & Left,Right,Bilateral,Unilateral & CodePhrase & 0.169841269841 \\
Midclavicular line, & left,Unilateral right,Unilateral & & \\
Midaxillary line, & & & \\
Midscapular line  & & & \\
\hline

Identified landmark& Body site & ELEMENT & 0.333333333333 \\
Identified landmark& Laterality & ELEMENT & 0.111111111111 \\
Identified landmark& Body site description & ELEMENT & 0.111111111111 \\ 
Aspect& Body site & ELEMENT & 0.181818181818 \\
Aspect& Laterality & ELEMENT & 0.0714285714286 \\
Aspect& Body site description & ELEMENT & 0.0714285714286 \\
\hline
Above,Below,Medial to, & Left,Right,Bilateral,Unilateral & CodePhrase & \textbf{0.419266088408}\\
Lateral to,Superior to,& left,Unilateral right,Unilateral & & \\
Inferior to,Anterior to, & & & \\
Posterior to,Inferolateral to...  & & & \\
\hline
Distance from landmark& Body site & ELEMENT & 0.333333333333 \\
Distance from landmark& Laterality & ELEMENT & 0.111111111111 \\
Distance from landmark& Body site description & ELEMENT & 0.111111111111 \\
Position frame of reference& Body site & ELEMENT & 0.111111111111\\
Position frame of reference& Laterality & ELEMENT & 0.0909090909091\\
Position frame of reference& Body site description & ELEMENT & 0.0909090909091\\
X offset& Body site & ELEMENT & 0.1 \\
X offset& Laterality& ELEMENT & 0.0833333333333\\
X offset& Body site description& ELEMENT & 0.0833333333333 \\
Visual markings/orientation& Body site& ELEMENT & 0.142857142857 \\
Visual markings/orientation& Laterality& ELEMENT & 0.111111111111 \\
Visual markings/orientation& Body site description & ELEMENT &  0.111111111111\\
Description &  Body site& ELEMENT & 0.142857142857 \\
Description &  Laterality& ELEMENT & 0.111111111111\\
Description &  Body site description& ELEMENT & \textbf{1.0}\\
Image& Body site& ELEMENT & 0.1 \\
Image& Laterality& ELEMENT & 0.0833333333333\\
Image& Body site description& ELEMENT & 0.0833333333333\\

\hline
 		
	 \caption{\label{paper3_result1}Similarity scores of pairs of SNOMED-CT
			 concepts from \emph{anatomical location
			 precise} and \emph{body site} terminological shadow comparison}
\end{longtable}
\end{center}



Table \ref{paper3_result1} reveals how closely related these two archetypes are by the measure
of semantic similarity in SNOMED-CT. It can be observed from the similarity scores that the most
related pair is the archetype nodes with description \texttt{Left, Right, Bilateral} from \emph{anatomical location 
precise} and \texttt{Left, Right, Bilateral, Unilateral left, Unilateral right, Unilateral} from
\emph{body site}, which achieved the highest score $0.47996031746$ in the result. 
Both nodes share a common clinical concept, which is the `laterality of the location'. 
Although quite obvious from a human perspective that these two CODEPHRASE are semantically close to
each other, it may not be always easy for computer systems to find out these two entities are closely
related. The second highest similarity score, which has been achieved by
nodes \texttt{Above, Below, Medial to, Lateral to\ldots} and \texttt{Left, Right, Bilateral, Unilateral
left, Unilateral right, Unilateral}, shows that they are not so close as the previous pair. The fact
that both these archetype nodes represent locational concepts contributes to the similarity score,
but the SNOMED-CT concepts in shadows can be used to distinguish the semantic difference. 
These similarity scores demonstrates the ability of using terminological shadows as an instrument to
measure the semantic similarity to differentiate clinical archetypes. 

The highest similarity score produced in the result is
$1.0$ for \texttt{Description} and \texttt{Body site
description}. It is caused by the terminological shadow creation process, which generates the same
SNOMED-CT concept for both nodes, therefore causing the semantic similarity to become 1. 
However, there are reasons to believe that these archetype node descriptions are too general and similar, which led to a
soaring score. Future improvement of the method will address these issues by adding steps to verify
the equivalence of the two descriptions.


Additionally, Figure \ref{compare1} visualises the highly related nodes in the two
archetypes under comparison. 
\begin{figure}[!htbp]
\begin{center}
\includegraphics[width=\textwidth]{../res/compare1}
\end{center}
\caption{The highly related nodes as a result of the comparison according to the similarity scores}
\label{compare1}
\end{figure}
In the figure, two archetype node trees are shown side by side with the different shapes that
represent nodes of different reference model classes. Each node contains its description, which
comprises the archetype term column in Table \ref{paper3_result1}. The links between the nodes
denote the similarity measure scores. It can be observed that the
CODEPHRASE of the node \texttt{Laterality} on the left appears most related to the
CODEPHRASE of the node \texttt{Side} on the right. This conclusion is based on the fact that
the Laterality-Side pair achieved the highest score among three pairs of nodes involving
\texttt{Laterality}. This is intuitive since both nodes express the meaning of laterality. 
%Secondly, the figure suggested that although the \emph{ELEMENT} node
%\textbf{Body site} on the left appears to be linked to both \textbf{Distance from the landmark} and
%\textbf{Identified landmark} on the right, they are indistinguishable by the similarity scores.
%Naturally \textbf{Body site} does not seem related to any of the nodes on the right.
%However, despite the high ambiguity of the node,  
Similarly it is intuitive from a human perspective that ELEMENT \texttt{Body site
description} on the left is strongly linked to ELEMENT \texttt{Description} on the right.
Interestingly the diagram seems to suggest that although not a mature archetype, the body
site archetype could be aligned and in fact identified as a generic version of the anatomical location
precise archetype. From author's speculation, the latter seems to extend the \texttt{Laterality}
node to more specific parts.





% add codephrase comparison algorithm in appendix
The result generated from the comparison between archetype body site and blood
pressure are listed in Table \ref{paper3_result2} with highlighted pairs of nodes according to 
high achieving similarity scores.
\begin{center}
\footnotesize
\begin{longtable}[!htbp]{l | l | l | l}

  \hline
  \textbf{Blood pressure} & \textbf{Body site} & RM class & score\\
  \hline
 
Systolic& Body site & ELEMENT & 0.125\\
Systolic& Laterality  & ELEMENT & 0.1\\
Systolic& Body site description  & ELEMENT & 0.1\\
Diastolic& Body site& ELEMENT & 0.125\\
Diastolic& Laterality& ELEMENT & 0.1\\
Diastolic& Body site description& ELEMENT & 0.1\\
Mean Arterial Pressure& Body site& ELEMENT & 0.125\\
Mean Arterial Pressure& Laterality& ELEMENT& 0.1\\
Mean Arterial Pressure& Body site description& ELEMENT& 0.1\\
Pulse Pressure& Body site& ELEMENT & 0.142857142857\\
Pulse Pressure& Laterality& ELEMENT & 0.111111111111\\
Pulse Pressure& Body site description& ELEMENT & 0.111111111111\\
Comment& Body site& ELEMENT & 0.1 \\
Comment& Laterality& ELEMENT & 0.0833333333333\\
Comment& Body site description& ELEMENT & 0.0833333333333\\
Position& Body site& ELEMENT & 0.142857142857\\
Position& Laterality& ELEMENT & \textbf{0.428571428571}\\ 
Position& Body site description& ELEMENT &  0.111111111111 \\
\hline
Standing,Sitting,Reclining, & Left,Right,Bilateral,Unilateral & CodePhrase & 0.0821366133866\\ 
Lying,Lying with tilt to left & left,Unilateral right,Unilateral & & \\
\hline

Confounding factors& Body site & ELEMENT &  0.142857142857 \\
Confounding factors& Laterality& ELEMENT &  0.111111111111 \\
Confounding factors& Body site description& ELEMENT &  0.111111111111 \\
Sleep status& Body site& ELEMENT &  0.0833333333333 \\
Sleep status& Laterality& ELEMENT &  0.0714285714286 \\
Sleep status& Body site description& ELEMENT &  0.0714285714286 \\
\hline
Alert \& awake,Sleeping & Left,Right,Bilateral,Unilateral & CodePhrase &  0.0806485181485 \\
		        & left,Unilateral right,Unilateral & & \\
\hline

Tilt& Body site& ELEMENT &  0.111111111111 \\
Tilt& Laterality& ELEMENT &  0.0909090909091 \\
Tilt& Body site description& ELEMENT &  0.0909090909091 \\
Cuff size& Body site& ELEMENT & 0.125\\
Cuff size& Laterality& ELEMENT & 0.375\\
Cuff size& Body site description& ELEMENT & 0.1\\
\hline
Adult Thigh,Large Adult,Adult,& Left,Right,Bilateral,Unilateral & CodePhrase &  0.120328282828 \\
Small Adult,Paediatric/Child, & left,Unilateral right,Unilateral & & \\
Infant,Neonatal &  & & \\
\hline

Location & Body site & CLUSTER &  0.142857142857 \\
Location of measurement& Body site & ELEMENT &  0.142857142857 \\
Location of measurement& Laterality& ELEMENT &  \textbf{0.428571428571} \\
Location of measurement& Body site description& ELEMENT &  0.111111111111 \\
\hline
Right arm,Left arm,Right thigh, & Left,Right,Bilateral,Unilateral & CodePhrase &  0.101790652158 \\
Left thigh,Right wrist,Left wrist, & left,Unilateral right,Unilateral & & \\
Right ankle,Left ankle,Finger, & & & \\
Toe,Intra-arterial  & & & \\  
\hline
Specific location& Body site& ELEMENT &  0.142857142857 \\
Specific location& Laterality& ELEMENT &  \textbf{0.428571428571} \\
Specific location& Body site description& ELEMENT &  0.111111111111 \\
Method& Body site& ELEMENT &  0.142857142857 \\
Method& Laterality& ELEMENT &  \textbf{0.666666666667} \\
Method& Body site description& ELEMENT &  0.111111111111 \\
\hline
Auscultation,Palpation, & Left,Right,Bilateral,Unilateral & CodePhrase &  0.184508547009 \\
Machine,Invasive & left,Unilateral right,Unilateral & & \\
\hline
Mean Arterial Pressure Formula& Body site& ELEMENT & 0.125\\
Mean Arterial Pressure Formula& Laterality& ELEMENT & 0.1\\
Mean Arterial Pressure Formula& Body site description& ELEMENT & 0.1\\
Diastolic endpoint& Body site& ELEMENT & 0.125\\
Diastolic endpoint& Laterality& ELEMENT & 0.1\\
Diastolic endpoint& Body site description& ELEMENT & 0.1\\
\hline
Phase IV,Phase V & Left,Right,Bilateral,Unilateral & CodePhrase &  0.0780844155844 \\
		 & left,Unilateral right,Unilateral & & \\
  \hline
 		

	 \caption{\label{paper3_result2}Highly related pairs of archetype nodes 
	 between \emph{blood pressure} and \emph{body site} with their similarity scores}
\end{longtable}
\end{center}
The slightly different choice of archetypes produced a result that shows the two are not quite
correlated. However they are expected to be, to some extent, semantically unrelated because the
two archetypes serve different clinical purposes. Arguably the node \texttt{Laterality} does not 
seem directly related with nodes that are suggested in the table. Although semantically speaking 
\texttt{Location of measurement} and \texttt{Specific location} can be considered relevant to
what the body site archetype expresses. If understood correctly, the \texttt{Location of
measurement} node represents the encoded clinical information that specifies the location of blood pressure
measurement and \texttt{Specific location} a general description of that location. The body site
archetype on the other hand represents the general information of a location with laterality 
on the body on which clinical observation or activity is taking place. Therefore as a re-usable
component the body site archetype is related to the nodes that attempt to capture location of the
blood pressure measurement. However the pending question is, why \texttt{Location of measurement},
\texttt{Specific location} and \texttt{Position} have achieved the same similarity score, and why
the seemingly unrelated \texttt{Method} achieved a much higher score than the rest. These
scores will be decomposed and discussed in the discussion section.





\subsection{Discussion}
\label{sec:paper2disc}
The results captured from the comparison not only demonstrate the applicability of terminological
shadows but also opens up a new approach to use SNOMED-CT to measure the relatedness and
correlation of two archetypes. 


The first comparison, which occurs between body site and anatomical location precise archetypes, 
has successfully identified some common parts that are semantically similar. As Figure
\ref{compare1} illustrates, the anatomical location precise archetype is more elaborate than the
body site archetype. It can be observed that the main purpose of the body site archetype is to provide
a re-usable component describing a site in which certain clinical activity is taking place. With
such a general intention, it comprises of three ELEMENT nodes that contain generic information
about the body site: 
\begin{itemize}
  \item \textbf{Body site}: a node contains the name for the site.
  \item \textbf{Body site description}: a description for the site.
  \item \textbf{Laterality}: a node that contains a \emph{CodePhrase} to describe the laterality of
    the site.
\end{itemize}
The result of the comparison is intuitive to suggest that the CODEPHRASE contained by
\texttt{Laterality} is the most equivalent to the CODEPHRASE of node \texttt{Side} from the
anatomical location precise archetype. The similarity scores of other CODEPHRASE pairs, such as
Laterality-Aspect and  Laterality-Numerical identifier, are lower than Laterality-Side even the word
`lateral to' appears in the \texttt{Aspect} node. As mentioned in the result section, the similarity
score for Body site description-Description pair is caused by the same SNOMED-CT concept in their
shadows. However it is intuitive to believe that these nodes are similar and equally vague. An
improvement in the algorithm in the future should be made to eliminate this issue, such as adding
additional string comparison to make sure the resulting SNOMED-CT concepts are not the same. 
The comparison has failed to find a semantically similar node for the ELEMENT node
\texttt{Body site}. It is clear that the body site archetype has a less complex structure than the anatomical location
precise archetype. In the latter archetype, the information about the location breaks into two
categories: \texttt{Relative location} and \texttt{Specific location}. Nevertheless the
variety created by this specialisation has not hampered the terminological shadow approach to find
similarities between the two archetypes.  

The second archetype comparison, which is between body site and blood pressure, results in four pair
of nodes that are suggested closely related. Figure \ref{compare2} highlights the pairs of nodes
that are suggested highly related from the result of the second comparison. 
\begin{figure}[!htbp]
\begin{center}
\includegraphics[width=\textwidth]{../res/compare2}
\end{center}
\caption{The suggested highly related pairs of nodes from the second comparison 
and their SNOMED-CT concepts}
\label{compare2}
\end{figure}
It shows the SNOMED-CT concept to which 
each node has been mapped and the calculated similarity scores for each pair. Careful readers might
notice that both \texttt{Location of measurement} and \texttt{Specific location} have been mapped to
the same SNOMED-CT concept \emph{Location (attribute): 246267002}. This is caused by the fact that 
the term ``measurement'' is the 29$^{th}$ most frequent terms in all the words in SNOMED-CT. There
are 10847 occurrences of ``measurement'' in total 93791 unique terms of SNOMED-CT descriptions. This 
makes ``measurement'' insignificant when it is evaluated by the \emph{tf-idf} weighting function, 
therefore results in mapping both nodes to the same concept in SNOMED-CT.
The similarity scores suggest that \texttt{Laterality} node is equally related to \texttt{Position}, 
\texttt{Location of measurement} and \texttt{Specific location}, but has a stronger connection to
\texttt{Method}. The counter-intuitive correlation between \texttt{Laterality} and
\texttt{Method} can be explained by the similarity measure and the SNOMED-CT structure. Figure
\ref{compare2a} shows how the scores are derived based on the location of these concepts in
SNOMED-CT. As the graph indicates the green line that represents the semantic distance between concept
\emph{Laterality} and \emph{Method}, appears shorter comparing to the distance between
\emph{Laterality} and \emph{Location}/\emph{Position}, which is rendered as red lines. Consequently
the latter pair will result in a lower similarity score that is deduced by Equation \ref{eq:sim}.
\begin{figure}[!htbp]
\begin{center}
\includegraphics[width=\textwidth]{../res/compare2a}
\end{center}
\caption{The cause of a higher similarity score in the second comparison}
\label{compare2a}
\end{figure}
This graph shows that concepts in the SNOMED-CT network are not always organised by the semantic
meanings, which means concepts that close to each other in a hierarchical structure can not always
be assumed to be closely related in semantics. This type of structure is also observed in other biomedical
ontologies. The purpose for grouping semantically unrelated concepts in medical ontologies varies
with different approaches. SNOMED-CT uses a concept model to compose clinical statement by combining
concepts. The \emph{Attribute} category contains concepts to be used as linkage concepts.
It can break down further into \emph{Concept model attribute} and \emph{Unapproved
attribute}\footnote{The unapproved attribute category contains concepts yet to be confirmed to 
use for composition}. Therefore
the concepts in this sub-graph are not clinically related to each other. The result of the second
comparison shows that this method is not very robust if the distances of concepts in the ontology do
not necessarily represent their similarity. However, a customised approach can be tailored
to ontology such as SNOMED-CT to maximise the benefit of this terminological shadow application.

The design of the archetype comparison application uses the reference model
class type to pair archetype nodes and calculate the similarity score. However
the archetype comparison algorithm has yet exploited the structure of
archetypes to enhance archetype comparison. For example a multiplier factor
could be introduced to give archetype ``ELEMENT'' leaf nodes more weight when
calculating the similarity score. Section [section] will discuss the planned
improvement for archetype comparison that incorporates archetype structure for
future work.



\section{Summary}
% todo -- CHANGE!!!!!!
This chapter presents two example applications of the terminological shadow approach. The first
application adopted and reapplied a previously published method that automatically maps archetype
terms to SNOMED-CT concepts to generate the overview of SNOMED-CT concept coverage in an archetype
repository. The application utilised what is termed the terminological `Shadow' approach,
effectively identified a number of under-covered and well-covered SNOMED-CT categories in the chosen
archetype repository. The author believes as a result of their experiences that the `Shadow' method
can be used more generally to identifying the coverage of SNOMED-CT concepts, and potentially, other
terminology systems among clusters of archetypes that are being created.  This contribution is
applicable to the management of large archetype repositories and it will guide archetype developers
to pay attention to the relative size of equivalent terminological categories when assessing the
amount of work required in creating terminological bindings for particular archetypes.  For example,
if a seemingly significant SNOMED-CT category such as \emph{Microorganism (organism)} is not covered
well  but is used extensively in a particular clinical scenario, developers in that area should make
changes in archetypes to cover more concepts in this SNOMED-CT category. One possible solution is to
embed queries to link one archetype term to multiple external references \parencite{openehr2007adl}.  Not only
can the `Shadow' be used to monitor the coverage of clinical concepts in an archetype repository, it
will also enhance archetype browsing because all archetype terms are mapped to SNOMED-CT concepts.
Experts who are familiar with the SNOMED-CT classification can use the mapped SNOMED-CT concepts as
labels to traverse easily through archetypes without the knowledge of Archetype Definition Language
syntax.


The second shadow application compares archetypes and measures their similarity.
The archetype comparison that has been implemented in this example provides
a means to identify semantically similar parts in different archetypes.
It could be concluded that the comparison function has successfully demonstrated the applicability of
terminological shadows. The study also shows that SNOMED-CT as a medical ontology is capable to
measure the semantic similarity of clinical content such as archetypes. Despite the limitations of the
approach in terms of the accuracy of SNOMED-CT mapping in the terminological shadows and the choice
of the semantic similarity measure, it has the potential to open up a new field for
archetype/SNOMED-CT researchers. 

A number of conclusions can be drawn from the result of this application. SNOMED-CT has a wide medical
domain coverage to be used as a reference ontology to correlate archetypes. The mapped SNOMED-CT
concepts in the terminological shadows are useful for the comparison of archetypes. The semantic
similarity measure chosen in the study is robust and has achieved the basic objectives to distinguish and correlate
semantic content in archetypes. The efficiency of the similarity measure may need to be evaluated
and improved in a more dedicated and rigorous study.



% conclusion: SNOMED-CT hierarchy needs to be taken into consideration, a better RM class
% match making -- e.g don't compare ELE that contains text with ELE contains numbers

% todo -- as research goes on, advanced archetype term-SNO binding methods emerges [ref 2 cite]
% the archetype comparison offers an ideal platform to adopt and utilise these algorithms to 

%\end{document}
